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ABSTRACT 

The “Collective Intelligence” (COIN) framework concerns 
the design of collectives of agents so that as those agents 
strive to maximize their individual utility functions, their 
interaction causes a provided “world” utility function con- 
cerning the entire collective to be also maximized. Here 
we show how to extend that framework to scenarios hav- 
ing Markovian dynamics when no re-evolution of the sys- 
tem from counter-factual initial conditions (an often expen- 
sive calculation) is permitted. Our approach transforms 
the (time-extended) argument of each agent’s utility func- 
tion before evaluating that function. This transformation 
has benefits in scenarios not involving Markovian dynam- 
ics, in particular scenarios where not all of the arguments 
of an agent’s utility function are observable. We investigate 
this transformation in simulations involving both linear and 
quadratic (nonlinear) dyn ami cs. In addition, we find that a 
certain subset of these transformations, which result in utili- 
ties that have low “opacity (analogous to having high signal 
to noise) but are not “factored” (analogous to not being 
incentive compatible), reliably improve perormanceormance 
over that arising with factored utilities. We also present a 
Taylor Series method for the fully general nonlinear case. 

1. INTRODUCTION 
1.1 Background 

In this paper we are concerned with large distributed col- 
lectives of interacting goal-driven computational processes, 
where there is a provided ‘world utility’ function that rates 
the possible behaviors of that collective [29, 27]. We are 
particularly concerned with such collectives where the indi- 
vidual computational processes use machine learning tech- 
niques (e.g., Reinforcement Learning (RL) [14, 20, 19, 23]) 
to try to achieve their individual goals. We represent those 
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goals of the individual processes as maximizing an associ- 
ated payoff’ utility function, one that in general can differ 
from the world utility. 

In such a system, we are confronted with the following 
inverse problem: How should one initialize/update the pay- 
off utility functions of the individual processes so that the 
ensuing behavior of the entire collective achieves large val- 
ues of the provided world utility? Lol particular, since in 
truly large systems detailed modeling of the system is usu- 
ally impossible, how can we avoid such modeling? Can we 
instead leverage the simple assumption that our learnering 
algorithms are individually fairly good at what they do to 
achieve a large world utility value? 

This problem is related to work in many other fields, in- 
cluding multi-agent systems (MAS’s), computational eco- 
nomics, mechanism design, reinforcement learning, statis- 
tical mechanics, computational ecologies, (partially observ- 
able) Markov decision processes and game theory. However 
none of these fields is both applicable in large problems, and 
directly addresses the general inverse problem, rather than 
a special instance of it. (See [27] for a detailed discussion of 
the relationship between these fields, involving hundreds of 
references.) For example, the field of mechanism design is 
not generally applicable, being largely tailored to collectives 
of human beings, and in particular to the idiosyncracy of 
such collectives that their members have hidden variables 
whose values they “do not want to reveal”. There is other 
previous work that does consider the general inverse prob- 
lem, and even has each individual computational process 
(or “agent”) use reinforcement learning [2, 7, 10, 15, 16]. 
However, in that work in general each process has the world 
utility function as its payoff utility function (i.e., implements 
a “team game” or an “exact potential game” [8]). Unfor- 
tunately, as expounded below and in previous work, this 
approach scales extremely poorly to large probl ems . (In- 
tuitively, the difficulty is that each agent can have a hard 
time discerning the echo of its behavior on the world utility 
when the system is large; each agent has a horrible “signal- 
to- noise” problem.) 

Intuitively, we are concerned with payoff utility functions 
that are “aligned” with the world utility, in that modifi- 
cations a player might make that would improve its payoff 
utility also must improve world utility. 1 Fortunately the 

^uch alignment can be viewed as an extension of the con- 
cept of incentive compatibility in mechanism design [9] to 



equivalence class of such payoff utilities extends well be- 
yond team-game utilities. In particular, in previous work 
we used the Collective intelligence (COIN) framework to 
derive the ‘Wonderful Life Utility’ (WLU) payoff function 
[27] as an alternative to a team-game payoff utility. The 
WLU is aligned with world utility, as desired. In addition 
though, WLU overcomes much of the signal-to-noise prob- 
lem of team game utilities [22, 29, 27, 31]. 

As an example, in some of our previous work we used 
the WLU for distributed control of network packet rout- 
ing [29]. Conventional approaches to packet routing have 
each router run a shortest path algorithm (SPA). Unlike 
with a WLU-based collective, in SPA-based routing there 
is no concern for deleterious side-effects of routing decisions 
on the global goal (e.g., no concern for bottlenecks). We 
ran simulations demonstrating that a WXU-based collective 
has substantially better throughputs than the best possible 
SPA-based system [29], even though that SPA-based system 
had information denied the COIN system. 

As another example, in [30] we considered the pared-down 
problem domain of a congestion game, in particular a more 
challenging variant of Arthur’s El Faiol bar attendance prob- 
lem [ 1 ], sometimes also known as the “minority game” [ 6 ]. 

In this problem the individual processes making up the col- 
lective are explicitly viewed as ‘players’ involved in a non- 
cooperative game. Each player has to determine which night 
in the week to attend a bar. The problem is set up so that if 
either too few people attend (boring evening) or too many 
people attend (crowded evening), the total enjoyment of the 
attending players drops. Our goal is to design the payoff 
functions of the players so that the total enjoyment across 
all nights is maximized. In this previous work we showed 
that use of the WLU can result in performance orders of 
magnitude superior to that of team game utilities. 

1.2 The Contribution of This Paper 

In this paper we extend this previous work with an ap- 
proach based on Transforming Arguments of Utility func- 
tions (TAU) before the evaluation of those functions. The 
TAU process was originally designed to be applied to the in- 
dividual utility functions of the agents in system s in which 
the world utility depends on the final state in an episode 
of variables outside the collective that undergo Markovian 
dynamics, with the update rule of those variables reflecting 
the state of the agents at the beginning of the episode. This 
is a very common scenario, obtaining whenever the agents in 
the collective act as control signals perturbing the evolution 
of a Markovian system. 

In the previous version of the COIN framework, to achieve 
good signal-to-noise for such scenarios might require re-evolving 
the system from counter-factual initial states of the agents to 
evaluate each agent’s reward for a particular episode. This 
can be computational^ expensive. With TAU utility func- 
tions no such re-evolving is needed; the observed history of 
the system in the episode is transformed in a relatively cheap 
calculation, and then the utility function is evaluated with 
that transformed history rather than the actual one. 

The TAU process has other advantages that apply even in 
scenarios not involving Markovian dynamics. In particular 
it allows us to employ the COIN framework even when not 
all arguments of the original utility function are observable, 
due for example to communication limitations. In addition, 

non- human agents, off-equilibrium behavior, etc. 


certain t}q>es of TAU transformations result in utility func- 
tions that are not exactly aligned with the world utility, 
but have so much better signal-to-noise that the collective 
performs better when agents use those transformed utility 
functions than it does with exactly aligned utility functions. 

In this paper computational experiments based on linear 
and quadratic (nonlinear) update rules for the Markovian 
system are presented that verify the foregoing. In particu- 
lar, in these experiments, we consider systems of 50 agents 
using a variety of world utilities and Markovian update rules. 
We compare the performance of using TAU utilities for the 
agents for linear and quadratic dynamics versus the per- 
formance using the corresponding team game utilities. We 
can also investigate s} r stems having limited observability. In 
these cases, the performance with TAU utilities even ro- 
bustly outperforms that of team game utilities in which 
there is full observability. We also find that the non-aligned, 
high signal-to-noise utilities consistently outperform their 
factored counterparts. We end with results using a Taylor 
Series method to address the more general nonlinear case 
than the quadratic one investigated here. 

2. THE MATHEMATICS OF COLLECTIVE 
INTELLIGENCE 

We view the individual agents in the collective as players 
involved in a repeated game . 2 Let Z with elements £ be the 
space of possible joint moves of all players in the collective 
in some stage. We wish to search for the £ that maximizes 
a provided world utility G(£). In addition to G we are 
concerned with utility functions {pr?}, one such function for 
each vari able /player 77 . We use the notation >7 to refer to all 
players other than 77 . 

2.1 Intelligence and the central equation 

We wish to “standardize” utility functions so that the 
numeric value they assign to a £ only reflects their ranking 
of £ relative to certain other elements of Z . We call such 
a standardization of an arbitrary utility U for player 77 the 
“intelligence for 77 at £ with respect to IT”. Here we rill 
use intelligences that are equivalent to percentiles: 

mc : n) = f - mo] , « 

where the Heaviside function 0 is defined to equal 1 when 
its argument is greater than or equal to 0 , and to equal 
0 otherwise, and where the subscript on the (normalized) 
measure dfi indicates it is restricted to £' sharing the same 
non -77 components as £. In general, the measure must reflect 
the type of system at hand, e.g., whether Z is countable 
or not, and if not, what coordinate system is being used. 
Other than that, any convenient choice of measure may be 
used and the theorems rill still hold. Intelligence value are 
always between 0 and 1 . 

Our uncertainty concerning the behavior of the system is 
reflected in a probability distribution over Z. Our ability 
to control the system consists of setting the value of some 
characteristic of the collective, e.g., setting the functions of 
the players. Indicating that value by s, our analysis revolves 

2 The full mathematics of the COIN framework, however, 
extends significantly beyond what is needed to address such 
games. See [28]. 



around the following central equation for P(G ] 5), which 
follows from Bayes’ theorem: 

P{G | s) = J de a P(G | e G ,s) j de g P(e G \ e s ,s)P(e g \ s ) , (2) 

where e 9 = (e Sin (C : J?i)> e Sl , 2 (C : m), • • • ) is the vector of the 
intelligences of the players with respect to their associated 
functions, and t G = {e G {£ : m)> e c?(C : *72), - * ■ ) is the vector 
of the intelligences of the players with respect to G . 

Note that e 9v (£ : rj) = 1 means that player 77 is fully 
rational at £, in that its move maximizes its utility, given 
the moves of the players. In other words, a point £ where 
e 9 V (C : 7 ?) = 1 f° r all players 77 is one that meets the def- 
inition of a game- theory Nash equilibrium [9]. Note that 
consideration of points £ at which not all intelligences equal 
1 provides the basis for a model-independent formalization 
of bounded rationality game theory, a formalization that 
contains variants of many of the theorems of conventional 
full-rationality game theory [25]. On the other hand, a £ at 
which all components of e G ~ 1 is a local maximum of G 
(or more precisely, a critical point of the G{£) surface). 

If we can choose $ so that the third conditional probability 
in the integrand is peaked around vectors e g all of whose 
componentsare close to 1, then we have likely induced large 
intelligences. If in addition the second term is peaked about 
e G equal to e 9i then c G will also be large. Finally, if the 
first term is peaked about high G when e G is large, then our 
choice of s will likely result in high G> as desired. 

Intuitively, the requirement that the utility functions have 
high “sign al-to- noise” (an issue not considered in conven- 
tional work in mechanism design) arises in the third term. 

It is in the second term that the requirement that the util- 
ity functions be “aligned with G” arises. In this work we 
concentrate on these two terms, and show how to simulta- 
neously set them to ha\ r e the desired form. 

Details of the stochastic environment in which the col- 
lective operates, together with details of the learning algo- 
rithms of the players, axe reflected in the distribution P(£) 
which underlies the distributions appearing in Equation 2. 
Note though that independent of these considerations , our 
desired form for the second term in Equation 2 is assured 
if we have chosen utility utilities such that equals ec ex- 
actly for all £. We call such a system factored . In game- 
theory language, the Nash equilibria of a factored collective 
axe local maxima of G. In addition to this desirable equi- 
librium behavior, factored collectives automatically provide 
appropriate off-equilibrium incentives to the players (an is- 
sue rarely considered in game theory / mechanism design). 

2.2 Opacity 

We now focus on algorithms based on utility functions 
{g v } that optimize the signal/noise ratio reflected in the 
third term, subject to the requirement that the system be 
factored. To understand how these algorithms work, given 
a measure dp(£ v ), define the opacity at C of utility U as: 


Clu((-V,s) = J d('J (C'lC) 


WC) - E'KUn)! 


(3) 


where J is defined in terms of the underlying probability 
distributions, 3 and (£' , £ v ) is defined as the worldline whose 


V components are the same as those of £ r while its 77 com- 
ponents are the same as those of £ ([28]). 

The denominator absolute value in the integrand in Equa- 
tion 3 reflects how sensitive 17(f) is to changing £ v . In con- 
trast, the numerator absolute value reflects how sensitive 
17(f) is to changing £- v . So the smaller the opacity of a util- 
it} r function g Vl the more g v ( f) depends only on the move of 
player 77, i.e., the better the associated signal- to-noise ratio 
for 77. Intuitively then, lower opacity should mean it is easier 
for 77 to achieve a large value of its intelligence. 

To formally establish this, we use the same measure dp 
to define opacity as the one that defined intelligence. Under 
this choice expected opacity bounds how close to 1 expected 
intelligence can be [28]: 

E(eu(£ : *7) | s) < 1 — K, where 
K < E(f2tf(f : 77,5) | s). (5) 

So low expected opacity of utility g v ensure that a necessary 
condition is met for the third term in Equation 2 to have the 
desired form for player 77. While low opacity is not, formally 
speaking, also sufficient for E(eu(C : r?) \ s) to be close to 1, 
in practice the bounds in Equation 5 are usually tight. 

2.3 Difference Utilities 

It is possible to solve for the set of all utilities that are 
factored with respect to a particular world utility. Unfortu- 
nately, in general it is not possible for a collective both to 
be factored and to have zero opacity for all of its players. 
However consider difference utilities, which are of the form 

U(£) = G(£) - r(/(f)) (6) 

where T(/) is independent of £ v . Any difference utility is 
factored [26], and under benign approximations, E(fl u | s ) 
is minimized over the set of such utilities by choosing 

F(/(f)) = E(G | f'^s) , (7) 

up to an overall additive constant. We call the resultant 
difference utility the Aristocrat utility (AU), loosely re- 
flecting the fact that it measures the difference between a 
player’s actual action and the average action. 

If possible, we would like each player 7/ to use the associ- 
ated AU as its utility function to ensure good form for both 
terms 2 and 3 in Equation 2. This is not always feasible how- 
ever. The problem is that to evaluate the expectation value 
defining its AU each player needs to evaluate the current 
probabilities of each of its potential moves. However if the 
player then changes its utility function to be the associated 
AU it will in general substantially change its ensuing behav- 
ior. (The player now wants to choose moves that maximize 
a different function from the one it was maximizing before.) 
In other words, it will change the probabilities of its moves, 
which means that its new utility function is in fact not the 
AU for its actual (new) probabilities. 

There are ways around this self-consistency problem, but 
in practice it is often easier to bypass the entire issue, by 

C‘77,5), with: 


^(Ct7>C I C>?>S) = 


P{Qr } \£^s)P{£ t v \£^s)p{£ t v ) 

2 + 

P(C; 1 MP(Cn ] CsMCv) 


(4) 


3 Writing it out in full, J((' | C) = 1 Cn.^/P^ I 
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giving each, r? a utility function that does not depend on the 
probabilities of r/’s own moves. One such utility function is 
the Wonderful Life Utility (WLU). The WLU for player 77 
is parameterized by a pre-fixed clamping parameter CL V 
chosen from among rfs possible moves: 

WLU n = G(0-C?(UCL*). (8) 

WLU is factored regardless of the choice of clamping param- 
eter. Furthermore, while not matching AU’s low opacity, 
WLU usually has far better opacity than does a team game. 


3. THE COIN FRAMEWORK FOR SYSTEMS 
WITH MARKOVIAN EVOLUTION 

We consider games which consist of multi-step “episodes” . 
Within each episode the entire system evolves in a Marko- 
vian manner from the initial moves of the players. We are 
interested in such games where some of the players 77 are 
not agents whose intial state is under control of a learning 
algorithm that we control, but rather constitute an “envi- 
ronment” for those controllable agents (i.e., where some of 
the players correspond to the state of nature). 

Let A be the Markovian single step evolution operator of 
the entire system through an episode, 




(9) 


Each component C* 7 , for example, could be a one-dimensional 
real number. The row vector would then be rfs update 
rule. .Alternatively, each agent could be represented by one 
of N symbolic values. In that case, Ct would be given in a 

W i7?I n n Moor Ko_ 
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sis). Considering such large spaces are necessary to describe 
arbitrary, nonlinear dynamics as Markovian evolution. Here 
we will concentrate on the former case, where the moves of 
the players are all real numbers. 

The full multiple time step evolution of an episode is given 
by single step operator in the usual way: Let 


C = 



L A* J 


where T is the number of time steps per episode. This opera- 
tor applied to our initial state Co yields the entire “worldline” 
C, or time history, of the system. 

C = CCo. (10) 


It is important to note that the particular form of C given 
above is not necessary for the results and methods of this 
paper to apply. In fact, there is no reason even to view the 
COIN-based choice of the g v as optimizing G for a multi- 
step game involving a “dynamics” process in some sense. It 
can be viewed as simply optimizing some G(CCo) for some 
“abstract” function C. As we will see, a major advantage of 
our approach to optimizing functions of this form is that C 
only needs to be run once to set the values of g v . Moreover, 
this single running of C is automatically done “by nature” as 
the system runs. There is no extra burden on the individual 
agents to perform calculations involving G, for example, to 
evaluate outcomes of counter-factual moves. 


4. LINEAR EVOLUTION 

4.1 Avoiding re-evolution of the system 

We now consider the operator F v for the case of linear evo- 
lution (i.e. C is a linear operator). For simplicity episodes 
axe composed of one time step (i.e. C = A), and agents ini- 
tially exist in one of two states (i.e., the players under our 
control can make one of two moves). As mentioned above, a 
sufficient condition for rf s difference utility g v to be factored 
is that the combination F V C Co is independent of rf s initial 
action. One way to accomplish this starts by clamping 77’s 
initial action, producing CL^ Co, where CL V is a clamping 
operator represented by a decimated identity matrix with 
zero- valued diagonal element at position 77. This clamped 
state must then be re-evolved to produce the desired com- 
bination, C{CL n Q 0). 

Unfortunately doing this means re-evolving the entire sys- 
tem, which may be computationally prohibitive, especially if 
it must be done for each agent. We define, therefore, a post- 
evolution clamping operator F v such that F V C = C(GL^), 
and therefore no re-evolving is needed once CCo has been 
evaluated (by nature). It follow that 

F v = C(CL V )C~\ (12) 

The spectral structure of the operator F v is readily deter- 
mined. The eigenvalues axe = 1 — <5*^ where is the 
Kronecker delta. Corresponding eigenvectors are ej? = c* 
where {cit} axe the columns of the linear evolution operator 
C. Since they span the space the post-evolved state can be 
expanded in terms of these eigenvectors of F v : 

ci=E°‘ e i ( l3 ) 
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We consider difference utility functions of the form 

9v(0 = G(C<o) - r„(F,CCo) (11) 

where G is the world utility function to be optimized. We 
will choose F v so that the product F V C( 0 is independent of 
agent 77’s actions. This is a necessary and sufficient condition 
for the associated difference utility g v (C) to he factored with 
respect to the world utility G for any and all choices of 
I\, - In general, can be chosen in such a way to optimize 
learnability. Here though, for simplicity, we choose = G. 
Accordingly, application of the F v operator is an instance 
of transforming the argument of the (second term of the) 
utility functions of the agents, i.e., it is a TAU process. 


Application of F v to the post-evolved state in this basis is 
straightforward. The result is F v Ct = Ct ~ where a v is 
the projection of Ct in the direction of . Furthermore, since 
eigenvectors of F v correspond to columns of G, the matrix 
C~ l acts as a projector onto this basis. Using this fact and 
recalling that Ct ~ GCt> it can be shown that = Q i.e. it 
equals agent 77’s action at t — 0. Thus, F v can be completely 
expressed in terms of observed post-evolution quantities: 

— Ct Co c »?* (14) 

In this way we can calculate the result of clamping the initial 
state and re-evolving without performing that re-evolution. 




Figure 1: Performance for 50 agents with linear dy- 
namics when the environment is set to zero at the 
beginning of each episode. Results for TAU g are 
represented by -f , results for 75% observability TAU, 
g 75% , are * then applying L to the first as well as 
second terms gives the utility with results de- 
picted as □. 9 2S% A, and finally, G , the 

team game, is 0. Errorbars are too small to see. 


I 

I 



Figure 2: System performance for 50 agents with 
linear dynamics in a random environment. Key is 
same as Figure 1. The small degradation in perfor- 
mance due to randomness from the environment. 


4.2 Observability restrictions 

In practice, the full worldline of the system may not be 
fully observable to each agent. Such limited observability 
of a particular component may be determined by the prob- 
lem. In other cases, due to communication constraints each 
agent is only allowed to observe a certain number of compo- 
nents, and must select which such components to observe, 
for example to optimize some auxiliary quantity like opac- 
ity. Similarly, the dynamics may not be known exactly to 
the agent; some rows of C may be uncertain to an agent, 
or simply cannot be determined. In these kinds of situa- 
tions the g v described above cannot be evaluated at the end 
of an episode by agent 77, even if the value G(Ct is globally 
broadcast to all agents. 

The TAU approach outlined above is well-suited to ad- 
dress such situations. Formally, a decimated identity oper- 
ator L can be defined whose diagonal elements axe {0, 1} 
depending on whether or not they are observable. The cor- 
responding factored utility for agent 77 is 

gv&) = G(C t )-G{LF v Ct), (15) 

where in general L may vary with 77. Given global broadcast 
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Figure 3: Comparison of average noise for factored 
g75% ( U pp er graph) and nonfactored 9nf % (lower 
graph) utility functions with 75 % observability. The 
first 100 time steps are the initial training period. 


to all agents of the value of G(Ct), for each agent to evaluate 
this type of g v only requires that those components of F v ( t 
that are non-zero (and therefore can vary) after application 
of the L operator be observed. 

This difference utility has two main sources of noise, one 
from potentially poor choice of the clamping operator, and 
the other from the use of L in the second (subtracted) term 
but not in the first. To address that latter source of noise we 
can impose limited observability on the first term in addition 
to the second one, getting 

9v (Ct) = G(LCt) - G(LF n Ct). (16) 

The new utility is not factored with respect to G. Ac- 
cording to the central equation however, it may still result 
in better performance than when we don’t have L in the first 
term, if the improvement in opacity more than offsets the 
loss of exact factoredness. In addition to the potential for 
such far superior opacity, this utility has the added advan- 
tage that now we don’t even need to rely on global broadcast 
of G(L£ t ) to evaluate g v . 

4.3 Experiments 

Numerical simulations were performed with 50 agents. Af- 
ter an initial 100-episode training period, agents selected 
initial actions in each subsequent episode with the same re- 
inforcement learning algorithm used in our previous work. 
All players underwent linear djmamics within each episode. 
The world utility function was a spin glass, 

GT='£j ij cki- (i7) 

i<j 

We collected statistics by averaging runs over many ran- 
domly set matrices A and coupling constants Jn. These 

runs were for systems whose first 25% and 75% components 
at the end of the episode axe observable, given some canon- 
ical ordering of agents. We considered both the case where 
the environment was initialized to zero (Figure 1) and where 
it was initialized randomly (Figure 2). We examined world 
utility value vs. episode number for six utility functions: 

1) TAU g for a fully observable system; 

2) TAU g for 75 % observability, g 75% ; 

3) The modification g^ % giving a non-factored system, 
again with 75 % observability; 

4) g 25% for a factored system with 25 % observability; 
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Figure 4: Comparison of average noise for factored 
g 25% (upper graph) and nonfact ored 9r'f° (lower 
graph) utilities with 25 % observability. 





Figure 5: System performance for 50 agents execut- 
ing quadratic dynamics. The environment has been 
initialized to zero. Key is same as in Figure 1. 


5) g for a non-factored system with 25 % observability; 

6 ) The team game, where every g v = G . 

Even the results for limited observability dearly outper- 
form the correspon din g- te am cr ame in which there is full 
observability. Furthermore, for < 100% observability, the 
non-factored utilities (L in both terms) consistently outper- 
form their factored counterpart. In these runs factoredness 
fell to approximately 90%, while noise levels in the utility 
functions were as shown in figures 3 and 4. The improve- 
ment in performance in the early post-training-period times 
due to better signal-to-noise more than outweighs the degra- 
dation due to loss in factoredness. 

5. NONLINEAR EVOLUTION 

Generalizing these results to arbitrary nonlinear dynamics 
requires high dimensional representations. In particular, in 
the case where all agents’ states axe binary, the number of 
joint states grows as 2 N where N is the number of agents. 
The successive bits in such a representation can be indicated 
as {xi} € B = {— oo, oo} where we have N bits altogether. 
Alternatively, we can expand the joint state in the basis of 
Walsh functions (1, {£*}, »•) which spans the set 

of all functions talcing elements of the space B to B. 

Doing this reduces the original nonlinear dynamics to lin- 
ear dynamics, at the price of expanding the size of the space. 
As an example, in the case of a quadratic update rule, we 
can represent £o in terms of second order Walsh functions 
Evolution of the system is accomplished by appli- 
cation of the associated evolution matrix C or A, yielding 
(ft — C(o- To obtain factored ntility functions, analagous 
post-evolution operators F v can be constructed. To ensure 
that the second term in the difference utility is independent 
of all terms involving x v will have to be subtracted. In 
the quadratic case, N such terms will have to be subtracted 
whereas in the linear case there was only one term. We find 

N 

( ig ) 

t 

where Ci, v is the column of C corresponding to the Walsh 
function X{X V , i = 1, N. Results of experiments for this 
case with 50 agents are presented in Figure 5. 

6. TAYLOR SERIES METHOD 


To address the more general nonlinear problem, we con- 
sider a slightly different framework. In this case, each agent 
is assigned a real-valued number r v . The state of the sys- 
tem (t is a vector including these numbers as components 
Each agent can choose among three actions which results 
in r-rj being modified by {±A,0}. Nonlinear update rules 
Cf — c(Co) are functions of these real- valued variables. 

Construction of factored utilities 

9v(Q) = G(c(C 0 ))-G(5(CL£o)). (19) 

requires that c(£o) be independent of 77 ’s choice of action. 
One way to accomplish this to clamp (apply CL) to Co and 
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approximate c(CL( 0 ) with a Taylor Series expansion about 
the unclamped Co initial state. 

c(CLCo) = c(Co) + A (Co - CLCo) ■ Vc(Co) (20) 

Varying A provides us a small parameter to control the ex- 
pansion. It should be noted that while this method requires 
that c(C) be differentiable, the world utility G need not be. 

Figure 6 presents results for a quadratic update rule with 
randomly generated coefficients c(Co) = Y2ij a t,iCoCo- The 
agents are given a random initial starting point with —1 < 
r v < 1 . Because c is quadratic, G(Ct) is a quartic polynomial 
in N dimensions. Since the coefficients {a ij} have random 
signs, the function G has as many increasing directions as 
decreasing directions. The goal of the system is to traverse 
this high dimensional surface, find an increasing direction, 
and then follow that direction to infinity. 

In light of the central equation we plot the average intel- 
ligences of the agents. For three possible actions, the best 
action has an intelligence of 1 while the worst choice gives 
0.33. A random walk (no learning) gives a value of 0.67 on 
average. Figure 7 shows that a team game has the same 
intelligence as a random walk. The TAU utility g displays 
a much higher intelligence which is also reflected in better 
performance. 

It is interesting to adjust the ratio of ± signs in the co- 
efficients of the polynomials. If we introduce, for example, 
more negative coefficients than positive, we expect the sur- 
face to preferentially turn down. The task for the agents 
becomes more challenging. We find, in fact, that three of 
the limited observability utilities perform worse over time 
(i.e. their world utility decreases). The team game also per- 




Figure 6: System performance for N = 50 agents 
using the Taylor Series method. The dynamics is 
governed by a quadratic function of the agents’ “po- 
sitions”. The world utility G is a quartic in N di- 
mensions. (upper two graphs are g and middle 

two are g 25 ^ and g 75% ; lower two are g 25% and a team 
game G-) The initial training period is not shown. 


team game perform worse over time (i.e. their world 
utilities decrease). 



Figure 9: Percentile intelligence for agents using 

TAU Qn (upper graph) versus a team game (lower 
graph), when the surface preferentially turns down. 
The degradation in intelligence as compared to Fig- 
ure 7 reflects the greater difficulty of the problem. 



Figure 7: Percentile intelligence for agents using 

TAU g (upper graph) versus a team game (lower 
graph). For three actions, a random walk (no learn- 
ing) would give an average intelligence of 67 %. 


forms worse over time. In fact, not only does the team game 
give poor performance, but it fails altogether. The lowest 
noise TAU utilities g and g!^° still give robust performance. 

To further study this dramatic difference in performance, 
we compared the average intelligence of the agents for g and 
the team game. The results axe show in Figure 9. In the 
case of the team game, again, there is no appreciable change 
in intelligence from the initial training period to when the 
agents are invoking learning algorithms. Conversely, for the 
g utility, the agents perform at a higher intelligence than 
the team game albeit lower than the situation in Figure 7. 

7. CONCLUSION 

We present a detailed extension of the COIN framework 
to systems the undergo Markovian evolution. We find con- 
sistent, robust improvement of performance as compared to 
the corresponding team game. The approach is applied 
to systems with linear and quadratic (nonlinear) update 
rules. Results from numerical simulations are presented. 
This framework also naturally includes the case of limited 
observability. We found that even COIN-based utility func- 
tions constrained by limited observability often outperformed 






conventional te am game utilities having full observability. 
We also found a new class of nonfactored utilities that con- 
sistently outperformed their factored counterpart, due to 
improved signal-to-noise characteristics. 

To address the general nonlinear case, we developed a 
Taylor Series method. In this case, the system of agents can 
be imagined to traverse an TV-dimensional surface. We find 
that the system’s performance can depend on the character- 
istics of the surface being optimized. We show that in some 
situations a team game will fail altogether (i.e. its perfor- 
mance will degrade over time) while the corresponding TAU 
utility continues to perform well. 
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